43 research outputs found
Design of Adiabatic MTJ-CMOS Hybrid Circuits
Low-power designs are a necessity with the increasing demand of portable
devices which are battery operated. In many of such devices the operational
speed is not as important as battery life. Logic-in-memory structures using
nano-devices and adiabatic designs are two methods to reduce the static and
dynamic power consumption respectively. Magnetic tunnel junction (MTJ) is an
emerging technology which has many advantages when used in logic-in-memory
structures in conjunction with CMOS. In this paper, we introduce a novel
adiabatic hybrid MTJ/CMOS structure which is used to design AND/NAND, XOR/XNOR
and 1-bit full adder circuits. We simulate the designs using HSPICE with 32nm
CMOS technology and compared it with a non-adiabatic hybrid MTJ/CMOS circuits.
The proposed adiabatic MTJ/CMOS full adder design has more than 7 times lower
power consumtion compared to the previous MTJ/CMOS full adder
EVALUATING THE IMPACT OF MEMORY SYSTEM PERFORMANCE ON SOFTWARE PREFETCHING AND LOCALITY OPTIMIZATIONS
Software prefetching and locality optimizations are two techniques for
overcoming the speed gap between processor and memory known as the
memory wall as suggested by Wulf and Mckee. This
thesis evaluates the impact of memory trends on the effectiveness
of software prefetching and locality optimizations for three types
of applications: regular scientific codes, irregular scientific
codes, and pointer-chasing codes. For many applications, software
prefetching outperforms locality optimizations when there is
sufficient bandwidth in the underlying memory system, but locality
optimizations outperform software prefetching when the underlying
memory system doesn't provide sufficient bandwidth. The break-even
point, or equivalently the crossover bandwidth point, occurs at
roughly 2.4 GBytes/sec , for 1 GHz processors on today's memory systems, and will increase on future memory systems. This thesis
also studies the interactions between software prefetching and
locality optimizations when applied in concert. Naively combining
the two techniques provides a more robust application performance
in the face of variations in memory bandwidth and/or latency, but
does not yield additional performance gains. In other words, the
performance won't be better than the best performance of the two
techniques alone. Also, several algorithms are proposed and
evaluated to better combine software prefetching and locality
optimizations, including an enhanced tiling algorithm, padding for software prefetching, and index prefetching.
(Also UMIACS-TR-2002-72
Locality Transformations and Prediction Techniques for Optimizing Multicore Memory Performance
Chip Multiprocessors (CMPs) are here to stay for the foreseeable future. In terms of programmability of these processors what is different from legacy multiprocessors is that sharing among the different cores (processors) is less expensive than it was in the past. Previous research suggested that sharing is a desirable feature to be incorporated in new codes. For some programs, more cache leads to more beneficial sharing since the sharing starts to kick in for large on chip caches. This work tries to answer the question of whether or not we can (should) write code differently when the underlying chip microarchitecture is powered by a Chip Multiprocessor. We use a set three graph benchmarks each with three different input problems varying in size and connectivity to characterize the importance of how we partition the problem space among cores and how that partitioning can happen at multiple levels of the cache leading to better performance because of good utilization of the caches at the lowest level and because of the increased sharing of data items that can be boosted at the shared cache level (L2 in our case) which can effectively be a prefetching effect among different compute cores.
The thesis has two thrusts. The first is exploring the design space represented by different parallelization schemes (we devise some tweaks on top of existing techniques) and different graph partitionings (a locality optimization techniques suited for graph problems). The combination of the parallelization strategy and graph partitioning provides a large and complex space that we characterize using detailed simulation results to see how much gain we can obtain over a baseline legacy parallelization technique with a partition sized to fit in the L1 cache. We show that the legacy parallelization is not the best alternative in most of the cases and other parallelization techniques perform better. Also, we show that there is a search problem to determine the partitioning size and in most of the cases the best partitioning size is smaller than the baseline partition size.
The second thrust of the thesis is exploring how we can predict the best combination of parallelization and partitioning that performs the best for any given benchmark under any given input data set. We use a PIN based reuse distance profile computation tool to build an execution time prediction model that can rank order the different combinations of parallelization strategies and partitioning sizes. We report the amount of gain that we can capture using the PIN prediction relative to what detailed simulation results deem the best under a given benchmark and input size. In some cases the prediction is 100% accurate and in some other cases the prediction projects worse performance than the baseline case. We report the difference between the simulation best performing combination and the PIN predicted ones as well as other statistics to evaluate how good the predictions are. We show that the PIN prediction method performs very well in predicting the partition size compared to predicting the parallelization strategy. In this case, the accuracy of the overall scheme can be highly improved if we only use the partitioning size predicted by the PIN prediction scheme and then we use a search strategy to find the best parallelization strategy for that partition size.
In this thesis, we use a detailed performance model to scan a large solution space for the best parameters for locality optimization of a set of graph problems. Using the M5 performance simulation we show gains of up to 20% vs. a naively picked baseline case. Our prediction scheme can achieve up to 100% of the best performance gains obtained using a search method on real hardware or performance simulation without running at all on the target hardware and up to 48% on average across all of our benchmarks and input sizes.
There are several interesting aspects to this work. We are the first to devise and verify a performance model against detailed simulation results. We suggest and quantify that locality optimization and problem partitioning can increase sharing synergistically to achieve better performance overall. We have shown a new utilization for coherent reuse distance profiles as a helping tool for program developers and compilers to a optimize program's performance
Multi-criteria Hardware Trojan Detection: A Reinforcement Learning Approach
Hardware Trojans (HTs) are undesired design or manufacturing modifications
that can severely alter the security and functionality of digital integrated
circuits. HTs can be inserted according to various design criteria, e.g., nets
switching activity, observability, controllability, etc. However, to our
knowledge, most HT detection methods are only based on a single criterion,
i.e., nets switching activity. This paper proposes a multi-criteria
reinforcement learning (RL) HT detection tool that features a tunable reward
function for different HT detection scenarios. The tool allows for exploring
existing detection strategies and can adapt new detection scenarios with
minimal effort. We also propose a generic methodology for comparing HT
detection methods fairly. Our preliminary results show an average of 84.2%
successful HT detection in ISCAS-85 benchmar
Predicting Expressibility of Parameterized Quantum Circuits using Graph Neural Network
Parameterized Quantum Circuits (PQCs) are essential to quantum machine
learning and optimization algorithms. The expressibility of PQCs, which
measures their ability to represent a wide range of quantum states, is a
critical factor influencing their efficacy in solving quantum problems.
However, the existing technique for computing expressibility relies on
statistically estimating it through classical simulations, which requires many
samples. In this work, we propose a novel method based on Graph Neural Networks
(GNNs) for predicting the expressibility of PQCs. By leveraging the graph-based
representation of PQCs, our GNN-based model captures intricate relationships
between circuit parameters and their resulting expressibility. We train the GNN
model on a comprehensive dataset of PQCs annotated with their expressibility
values. Experimental evaluation on a four thousand random PQC dataset and IBM
Qiskit's hardware efficient ansatz sets demonstrates the superior performance
of our approach, achieving a root mean square error (RMSE) of 0.03 and 0.06,
respectively
Efficient Intra-Rack Resource Disaggregation for HPC Using Co-Packaged DWDM Photonics
The diversity of workload requirements and increasing hardware heterogeneity
in emerging high performance computing (HPC) systems motivate resource
disaggregation. Resource disaggregation allows compute and memory resources to
be allocated individually as required to each workload. However, it is unclear
how to efficiently realize this capability and cost-effectively meet the
stringent bandwidth and latency requirements of HPC applications. To that end,
we describe how modern photonics can be co-designed with modern HPC racks to
implement flexible intra-rack resource disaggregation and fully meet the bit
error rate (BER) and high escape bandwidth of all chip types in modern HPC
racks. Our photonic-based disaggregated rack provides an average application
speedup of 11% (46% maximum) for 25 CPU and 61% for 24 GPU benchmarks compared
to a similar system that instead uses modern electronic switches for
disaggregation. Using observed resource usage from a production system, we
estimate that an iso-performance intra-rack disaggregated HPC system using
photonics would require 4x fewer memory modules and 2x fewer NICs than a
non-disaggregated baseline.Comment: 15 pages, 12 figures, 4 tables. Published in IEEE Cluster 202
BB-ML: Basic Block Performance Prediction using Machine Learning Techniques
Recent years have seen the adoption of Machine Learning (ML) techniques to
predict the performance of large-scale applications, mostly at a coarse level.
In contrast, we propose to use ML techniques for performance prediction at a
much finer granularity, namely at the Basic Block (BB) level, which are single
entry, single exit code blocks that are used for analysis by the compilers to
break down a large code into manageable pieces. We extrapolate the basic block
execution counts of GPU applications and use them for predicting the
performance for large input sizes from the counts of smaller input sizes. We
train a Poisson Neural Network (PNN) model using random input values as well as
the lowest input values of the application to learn the relationship between
inputs and basic block counts. Experimental results show that the model can
accurately predict the basic block execution counts of 16 GPU benchmarks. We
achieve an accuracy of 93.5% in extrapolating the basic block counts for large
input sets when trained on smaller input sets and an accuracy of 97.7% in
predicting basic block counts on random instances. In a case study, we apply
the ML model to CUDA GPU benchmarks for performance prediction across a
spectrum of applications. We use a variety of metrics for evaluation, including
global memory requests and the active cycles of tensor cores, ALU, and FMA
units. Results demonstrate the model's capability of predicting the performance
of large datasets with an average error rate of 0.85% and 0.17% for global and
shared memory requests, respectively. Additionally, to address the utilization
of the main functional units in Ampere architecture GPUs, we calculate the
active cycles for tensor cores, ALU, FMA, and FP64 units and achieve an average
error of 2.3% and 10.66% for ALU and FMA units while the maximum observed error
across all tested applications and units reaches 18.5%.Comment: Accepted at the 29th IEEE International Conference on Parallel and
Distributed Systems (ICPADS 2023